Search CORE

11 research outputs found

Towards directly modeling raw speech signal for speaker verification using CNNs

Author: Magimai.-Doss Mathew
Marcel Sébastien
Muckenhirn Hannah
Publication venue: Idiap
Publication date: 19/11/2017
Field of study

Speaker verification systems traditionally extract and model cepstral features or filter bank energies from the speech signal. In this paper, inspired by the success of neural network-based approaches to model directly raw speech signal for applications such as speech recognition, emotion recognition and anti-spoofing, we propose a speaker verification approach where speaker discriminative information is directly learned from the speech signal by: (a) first training a CNN-based speaker identification system that takes as input raw speech signal and learns to classify on speakers (unknown to the speaker verification system); and then (b) building a speaker detector for each speaker in the speaker verification system by replacing the output layer of the speaker identification system by two outputs (genuine, impostor), and adapting the system in a discriminative manner with enrollment speech of the speaker and impostor speech data. Our investigations on the Voxforge database shows that this approach can yield systems competitive to state-of-the-art systems. An analysis of the filters in the first convolution layer shows that the filters give emphasis to information in low frequency regions (below 1000 Hz) and implicitly learn to model fundamental frequency information in the speech signal for speaker discrimination

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Long Term Spectral Statistics for Voice Presentation Attack Detection

Author: Korshunov Pavel
Magimai.-Doss Mathew
Marcel Sébastien
Muckenhirn Hannah
Publication venue: Idiap
Publication date: 19/03/2017
Field of study

Automatic speaker verification systems can be spoofed through recorded, synthetic or voice converted speech of target speakers. To make these systems practically viable, the detection of such attacks, referred to as presentation attacks, is of paramount interest. In that direction, this paper investigates two aspects: (a) a novel approach to detect presentation attacks where, unlike conventional approaches, no speech signal related assumptions are made, rather the attacks are detected by computing first order and second order spectral statistics and feeding them to a classifier, and (b) generalization of the presentation attack detection systems across databases. Our investigations on Interspeech 2015 ASVspoof challenge dataset and AVspoof dataset show that, when compared to the approaches based on conventional short-term spectral processing, the proposed approach with a linear discriminative classifier yields a better system, irrespective of whether the spoofed signal is replayed to the microphone or is directly injected into the system software process. Cross-database investigations show that neither the short-term spectral processing based approaches nor the proposed approach yield systems which are able to generalize across databases or methods of attack. Thus, revealing the difficulty of the problem and the need for further resources and research

Infoscience - École polytechnique fédérale de Lausanne

AudioPaLM: A Large Language Model That Can Speak and Listen

We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examplesComment: Technical repor

arXiv.org e-Print Archive

Overview of BTAS 2016 Speaker Anti-spoofing Competition

Author: Korshunov Pavel
Marcel Sébastien
Muckenhirn Hannah
Gonçalves A. R.
Mello A. G. Souza
Violato R. P. Velloso
Simões F. O.
Neto M. U.
de Assis Angeloni M.
Stuchi J. A.
Dinkel H.
Chen N.
Qian Y.
Paul D.
Saha G.
Sahidullah Md
Publication venue
Publication date: 01/01/2016
Field of study

This paper provides an overview of the Speaker Anti-spoofing Competition organized by Biometric group at Idiap Research Institute for the IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS 2016). The competition used AVspoof database, which contains a comprehensive set of presentation attacks, including, (i) direct replay attacks when a genuine data is played back using a laptop and two phones (Samsung Galaxy S4 and iPhone 3G), (ii) synthesized speech replayed with a laptop, and (iii) speech created with a voice conversion algorithm, also replayed with a laptop. The paper states competition goals, describes the database and the evaluation protocol, discusses solutions for spoofing or presentation attack detection submitted by the participants, and presents the results of the evaluation

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Publikationer från Linköpings universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Archivio istituzionale della ricerca - Università di Padova

Overview of BTAS 2016 Speaker Anti-spoofing Competition

Author: Chen N.
de Assis Angeloni M.
Dinkel H.
Gonçalves A. R.
Korshunov Pavel
Marcel Sébastien
Mello A. G. Souza
Muckenhirn Hannah
Neto M. U.
Paul D.
Qian Y.
Saha G.
Sahidullah Md
Simões F. O.
Stuchi J. A.
Violato R. P. Velloso
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/07/2016
Field of study

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Trustworthy speaker recognition with minimal prior knowledge using neural networks

Author: Muckenhirn Hannah
Publication venue: Lausanne, EPFL
Publication date: 25/11/2019
Field of study

The performance of speaker recognition systems has considerably improved in the last decade. This is mainly due to the development of Gaussian mixture model-based systems and in particular to the use of i-vectors. These systems handle relatively well noise and channel mismatches and yield a low error rate when confronted with zero-effort impostors, i.e. impostors using their own voice but claiming to be someone else. However, speaker verification systems are vulnerable to more sophisticated attacks, called presentation or spoofing attacks. In that case, the impostors present a fake sample to the system, which can either be generated with a speech synthesis or voice conversion algorithm or can be a previous recording of the target speaker. One way to make speaker recognition systems robust to this type of attack is to integrate a presentation attack detection system. Current methods for speaker recognition and presentation attack detection are largely based on short-term spectral processing. This has certain limitations. For instance, state-of-the-art speaker verification systems use cepstral features, which mainly capture vocal tract system characteristics, although voice source characteristics are also speaker discriminative. In the case of presentation attack detection, there is little prior knowledge that can guide us to differentiate bona fide samples from presentation attacks, as they are both speech signals that carry the same high level information, such as message, speaker identity and information about environment. This thesis focuses on developing speaker verification and presentation attack detection systems that rely on minimal assumptions. Towards that, inspired by recent advances in deep learning, we first develop speaker verification approaches where speaker discriminative information is learned from raw waveforms using convolutional neural networks (CNNs). We show that such approaches are capable of learning both voice source related and vocal tract system related speaker discriminative information and yield performance competitive to state of the art systems, namely i-vectors and x-vectors-based systems. We then develop two high performing approaches for presentation attack detection: one based on long-term spectral statistics and the other based on raw speech modeling using CNNs. We show that these two approaches are complementary and make the speaker verification systems robust to presentation attacks. Finally, we develop a visualization method inspired from the computer vision community to gain insight about the task-specific information captured by the CNNs from the raw speech signals

Infoscience - École polytechnique fédérale de Lausanne

End-to-End Convolutional Neural Network-based Voice Presentation Attack Detection

Author: Magimai.-Doss Mathew
Marcel Sébastien
Muckenhirn Hannah
Publication venue
Publication date: 19/07/2017
Field of study

Development of countermeasures to detect attacks performed on speaker verification systems through presentation of forged or altered speech samples is a challenging and open research problem. Typically, this problem is approached by extracting features through conventional short-term speech processing and feeding them to a binary classifier. In this article, we develop a convolutional neural network-based approach that learns in an end-to-end manner both the features and the binary classifier from the raw signal. Through investigations on two publicly available databases, namely, ASVspoof and AVspoof, we show that the proposed approach yields systems comparable to or better than the state-of-the-art approaches for both physical access attacks and logical access attacks. Furthermore, the approach is shown to be complementary to a spectral statistics-based approach, which, similarly to the proposed approach, does not use prior assumptions related to speech signals

Infoscience - École polytechnique fédérale de Lausanne

On Learning to Identify Genders from Raw Speech Signal Using CNNs

Author: Kabil Selen Hande
Magimai.-Doss Mathew
Muckenhirn Hannah
Publication venue: 'International Speech Communication Association'
Publication date: 26/07/2018
Field of study

Automatic Gender Recognition (AGR) is the task of identifying the gender of a speaker given a speech signal. Standard approaches extract features like fundamental frequency and cepstral features from the speech signal and train a binary classifier. Inspired from recent works in the area of automatic speech recognition (ASR), speaker recognition and presentation attack detection, we present a novel approach where relevant features and classifier are jointly learned from the raw speech signal in end-to-end manner. We propose a convolutional neural networks (CNN) based gender classifier that consists of: (1) convolution layers, which can be interpreted as a feature learning stage and (2) a multilayer perceptron (MLP), which can be interpreted as a classification stage. The system takes raw speech signal as input, and outputs gender posterior probabilities. Experimental studies conducted on two datasets, namely AVspoof and ASVspoof 2015, with different architectures show that with simple architectures the proposed approach yields better system than standard acoustic features based approach. Further analysis of the CNNs show that the CNNs learn formant and fundamental frequency information for gender identification

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Understanding and Visualizing Raw Waveform-based CNNs

Author: Abrol Vinayak
Magimai.-Doss Mathew
Marcel Sébastien
Muckenhirn Hannah
Publication venue
Publication date: 05/09/2019
Field of study

Modeling directly raw waveforms through neural networks for speech processing is gaining more and more attention. Despite its varied success, a question that remains is: what kind of information are such neural networks capturing or learning for different tasks from the speech signal? Such an insight is not only interesting for advancing those techniques but also for understanding better speech signal characteristics. This paper takes a step in that direction, where we develop a gradient based approach to estimate the relevance of each speech sample input on the output score. We show that analysis of the resulting ``relevance signal" through conventional speech signal processing techniques can reveal the information modeled by the whole network. We demonstrate the potential of the proposed approach by analyzing raw waveform CNN-based phone recognition and speaker identification systems

Infoscience - École polytechnique fédérale de Lausanne

Crossref